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Abstract 

Support vector machines (SVMs) naturally embody sparseness due to their use of hinge 
loss functions. However, SVMs can not directly estimate conditional class probabilities. In 
this paper we propose and study a family of coherence functions, which are convex and 
differentiable, as surrogates of the hinge function. The coherence function is derived by 
using the maximum-entropy principle and is characterized by a temperature parameter. 
It bridges the hinge function and the logit function in logistic regression. The limit of 
the coherence function at zero temperature corresponds to the hinge function, and the 
limit of the minimizer of its expected error is the minimizer of the expected error of the 
hinge loss. We refer to the use of the coherence function in large-margin classification 
as "C -learning" and we present efficient coordinate descent algorithms for the training of 
regularized C-learning models. 

Keywords: Large-margin classifiers; Hinge functions; Logistic Functions; Coherence 
functions; Class predictive probability; C-learning. 



1. Introduction 

Large-margin classification methods have become increasingly popular since the advent of 
boosting (Freund, 1995), support vector machines (SVM) (Vapnik, 1998) and their variants 
such as 'i/'-learning (Shen et al., 2003). Large-margin classification methods are typically 
devised based on a majorization-minimization procedure, which approximately solves an 
otherwise intractable optimization problem defined with the 0-1 loss. For example, the 
conventional SVM employs a hinge loss, the AdaBoost algorithm employs the exponential 
loss, and i/^-learning employs a so-called -(/'-loss, as majorizations of the 0-1 loss. 

Large-margin classification methods can be unified using the tools of regularization 
theory; that is, they can be expressed as the form of "loss" -|- "penalty" (Hastie et ah, 2001). 
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Sparsciicss has also emerged as a significant theme generally associated with large-margin 
methods. Typical approaches for achieving sparseness are to use either a non-differentiable 
penalty or a non-differentiable loss. Recent developments in the former vein focus on the 
use of a ii penalty (Tibshirani, 1996) or the elastic-net penalty (a mixture of the £i and 
£2 penalties) (Zou and Hastie, 2005) instead of the I2 penalty which is typically used in 
large-margin classification methods. As for non-differentiable losses, the paradigm case is 
the hinge loss function that is used for the SVM and which leads to a sparse expansion of 
the discriminant function. 

Unfortunately, the conventional SVM does not directly estimate a conditional class 
probability. Thus, the conventional SVM is unable to provide estimates of uncertainty in 
its predictions — an important desideratum in real-world applications. Moreover, the non- 
differentiability of the hinge loss also makes it difficult to extend the conventional SVM to 
multi-class classification problems. Thus, one seemingly natural approach to constructing 
a classifier for the binary and multi-class problems is to consider a smooth loss function, 
while an appropriate penalty is employed to maintain the sparseness of the classifier. For 
example, regularized logistic regression models based on logit losses (Friedman et al., 2010) 
are competitive with SVMs. 

Of crucial concern are the statistical properties (Lin, 2002, Bartlett et al., 2006, Zhang, 
2004) of the majorization function for the original 0-1 loss function. In particular, we 
analyze the statistical properties of extant majorization functions, which are built on the 
exponential, logit and hinge functions. This analysis inspires us to propose a new ma- 
jorization function, which we call a coherence function due to a connection with statistical 
mechanics. We also define a loss function that we refere to as C-loss based on the coherence 
function. 

The C-loss is smooth and convex, and it satisfies a Fisher-consistency condition — a desir- 
able statistical property (Bartlett et al., 2006, Zhang, 2004). The C-loss has the advantage 
over the hinge loss that it provides an estimate of the conditional class probability, and over 
the logit loss that one limiting version of it is just the hinge loss. Thus, the C-loss as well 
as the coherence function have several desriable properties in the context of large-margin 
classifiers. 

In this paper we show how the coherence function can be used to develop an effective 
approach to estimating the class probability of the conventional binary SVM. Piatt (1999) 
first exploited a sigmoid link function to map the SVM outputs into probabilities, while 
SoUich (2002) used logarithmic scoring rules (Bernardo and Smith, 1994) to transform the 
hinge loss into the negative of a conditional log-likelihood (i.e., a predictive class probabil- 
ity). Recently, Wang et al. (2008) developed an interval estimation method. Theoretically, 
Steinwart (2003) and Bartlett and Tewari (2007) showed that the class probability can be 
asymptotically estimated by replacing the hinge loss with a differentiable loss. Our ap- 
proach also appeals to asymptotics to derive a method for estimating the class probability 
of the conventional binary SVM. 

Using the C-loss, we devise new large-margin classifiers which we refer to as C-learning. 
To maintain sparseness, we use the elastic-net penalty in addition to C-learning. We in 
particular propose two versions. The first version is based on reproducing kernel Hilbert 
spaces (RKHSs) and it can automatically select the number of support vectors via penal- 
ization. The second version focuses on the selection of features again via penalization. The 
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classifiers are trained by coordinate descent algorithms developed by Friedman et al. (2010) 
for generalized linear models. 

The rest of this paper is organized as follows. In Section 2 we summarize the fundamental 
basis of large-margin classification. Section 3 presents C-loss functions, their mathematical 
properties, and a method for class probability estimation of the conventional SVM. Section 4 
studies our C-learning algorithms. We conduct an experimental analysis in Section 5 and 
conclude our work in Section 6. All proofs are deferred to the Appendix. 

2. Large-Margin Classifiers 

We consider a binary classification problem with a set of training data T = {xj, yj}]*, where 
Xj G A' C M'' is an input vector and Ui e y = {1,-1} is the corresponding class label. Our 
goal is to find a decision function /(x) over a measurable function class J-. Once such an 
/(x) is obtained, the classification rule is y = sign(/(x)) where sign(a) = 1, 0, —1 according 
toa>0, a = Oora<0. Thus, we have that x is misclassified if and only if y/(x) < 
(here we ignore the case that /(x) = 0). 

Let r/(x) = Pr(y = l\X = x) be the conditional probability of class 1 given x and let 
P{X,Y) be the probability distribution over X x y. For a measurable decision function 
/(x) : ^ M, the expected error at x is then defined by 

*(/(x)) = E{I{Yf{X) < 0)\X = x) = /[/(x)<o]^(x) + /[/(x)>o] (1 - ^(x)), 

where /[^j = 1 if # is true and otherwise. The generalization error is 

= Ep/[y/(x)<0] = [l[fix)<o]V{^) + I[f{x)>o] (1 - vi^))] , 

where the expectation Ep is taken with respect to the distribution P{X, Y) and Ex denotes 
the expectation over the input data X. The optimal Bayes error is ^' = ^pI[Y{2r){x)-i)<o]i 
which is the minimum of with respect to measurable functions /. 

A classifier is a classification algorithm which finds a measurable function /t- : — )■ M 
based on the training data T. We assume that the (xj,yj) in T are independent and 
identically distributed from P{X,Y). A classifier is said to be universally consistent if 

lim = * 

n->-oo ' 

holds in probability for any distribution P on Xxy. It is strongly universally consistent if 
the condition lim„^oo ^f^ = ^is satisfied almost surely (Steinwart, 2005). 
The empirical generalization error on the training data T is given by 

1 " 

i=l 

Given that the empirical generalization error '^emp is equal to its minimum value zero when 
all training data are correctly classified, we wish to use '^emp ^ ^ basis for devising classi- 
fication algorithms. However, the corresponding minimization problem is computationally 
intractable. 
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Tabic 1: Surrogate losses for margin-based classifier. 



Exponential Loss 


Logit Loss 


Hinge Loss 


Squared Hinge Loss 


exp[-y/(x)/2] 


log[l+exp(-j//(x))] 


[i-y/(x)]+ 


([1 - y/(x)]+)^ 



2.1 Surrogate Losses 

A wide variety of classifiers can be understood as minimizers of a continuous surrogate loss 
function (l){y f {-x.)) , which upper bounds the 0-1 loss -f[j//(x)<o]- Corresponding to ^'(/(x)) 
and ^f, we denote -R(/(x)) = (/)(/(x))r?(x) + (/)(-/(x))(l -7?(x)) and 

Rf = Ep[</>(y/(X))] = Ex [<P{f{X))r,{X) + 0(-/(X))(l - . 

For convenience, we assume that r] G [0, 1] and define the notation 

The surrogate (p is said to be Fisher consistent, if for every r] G [0, 1] the minimizer of 
i?(77, /) with respect to / exists and is unique and the minimizer (denoted /(r/)) satisfies 
sign(/(r/)) = sign(r/- 1/2) (Lin, 2002, Bartlett et al., 2006, Zhang^^ 2004) . Since sign(M) = 
is equivalent to n = 0, we have that /(1/2) = 0. Substituting f{r]) into R{r],f), we also 
define the following notation: 

Riri) = mfR{riJ) = RiriJir])). 

The difference between R{r], /) and R{r]) is 

ARin, /) = R{r,, /) - R{v) = R{v, /) - R{v, f{v))- 

When regarding /(x) and ?7(x) as functions of x, it is clear that f{rf{y:)) is the minimizer 
of i2(/(x)) among all measurable function class T. That is, 

/(r/(x)) = argmini?(/(x)). 

In this setting, the difference between Rf and Ex[R{f {r){X)))] (denoted Rj) is given by 

ARf = Rf~ Rf = ExAR{rj{X), f{X)). 

If /(ry) is invertible, then the inverse function f~^{f{^) over T can be regarded as a 
class-conditional probability estimate given that ?7(x) = /-i(/(x)). Moreover, Zhang (2004) 
showed that Ai?y is the expected distance between the conditional probability /~^(/(x)) 
and the true conditional probability r/(x). Thus, minimizing -R/ is equivalent to minimizing 
the expected distance between /~^(/(x)) and 77(x). 

Table 1 lists four common surrogate functions used in large-margin classifiers. Here 
[u]-|_ = maxju, 0} is a so-called hinge function and ([u]-|-)^ = (max{u, 0})^ is a squared 
hinge function which is used for developing the ^2-SVM (Cristianini and Shawe- Taylor, 
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2000). Note that we typically scale the logit loss to equal 1 at yf{x.) = 0. These functions 
are convex and the upper bounds of /[.y/(x)<o]- Moreover, they are Fisher consistent. In 
particular, the following result has been established by Friedman et al. (2000) and Lin 
(2002). 

Proposition 1 Assume that < ??(x) < 1 and ??(x) 7^ 1/2. Then, the minimizers of 
E(exp[-y/(X)/2]|X = x) andE(log[l + exp(-y/(X))]|X = x) are both /(x) = log^^, 

the minimizer o/E([1 — Y f(X)]^\X = x) is /(x) = sign(ry(x) — 1/2), and the minimizer of 
E(([l - y/(X)]+)2|X = x) IS /(x) = 2ry(x) - 1. 

When the exponential or logit loss function is used, /~^(/(x)) exists. It is clear that 
r/(x) = /^^(/(x)). For any /(x) £ J^, we denote the inverse function by ry(x), which is 

7?(X) = /-1(/(X)) ^ 



l + exp(-/(x)) 



Unfortunately, the minimization of the hinge loss (which is the basis of the SVM) does not 
yield a class probability estimate (Lin et al., 2002). 

2.2 The Regularization Approach 

Given a surrogate loss function ^, a large-margin classifier typically solves the following 
optimization problem: 

1 " 

min-^</.(y,/(xO)+7J(/»), (1) 
1=1 

where /(x) = a + /i(x), J{h) is a regularization term to penalize model complexity and 7 
is the degree of penalization. 

Suppose that / = a + h G ({1}+ T-Lr) where T-Lk is a reproducing kernel Hilbert 
space (RKHS) induced by a reproducing kernel K{-, •) : XxX R. Finding /(x) is then 
formulated as a regularization problem of the form 

where is the RKHS norm. By the representer theorem (Wahba, 1990), the solution 

of (2) is of the form 

n 

/(Xi) =a + Y^ PjKi^i, X,-) = a + (3) 

i=i 

where (3 = {/3i, . . . , Pn)' and kj = (i^(xi, xi), . . . , K(xi, x„))'. Noticing that = 
i^(xj, Xj)/3j/3j and substituting (3) into (2), we obtain the minimization problem 
with respect to a and /3 as 
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where K = [ki,...,k„] is the nxn kernel matrix. Since K is symmetric and positive 
semidefinite, the term (3'K.(3 is in fact an empirical RKHS norm on the training data. 

In particular, the conventional SVM defines the surrogate (p{-) as the hinge loss and 
solves the following optimization problem: 

\ E[l - + |/3'K/3. (4) 

i=\ 

In this paper, we are especially interested in universal kernels, namely, kernels whose 
induced RKHS is dense in the space of continuous functions over X (Steinwart, 2001). The 
Gaussian RBF kernel is such an example. 



2.3 Methods for Class Probability Estimation of SVMs 

Let /(x) be the solution of the SVM problem in (4). In an attempt to address the problem of 
class probability estimation for SVMs, Sollich (2002) proposed a class probability estimate 
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if l/(x)|<l, 



rjM = l l+e-P(-2/(x)) _ 

:; r ^ — : — ,t, Otherwise. 

L l+exp[-(/(x)+sign(/(x)))] 

This class probability was also used in the derivation of a so-called complete SVM by Mallick 
et al. (2005). 

Another proposal for obtaining class probabilities from SVM outputs was developed by 
Piatt (1999), who employed a post-processing procedure based on the parametric formula 

57(x) 



l + exp(^/(x) + 5) 



where the parameters A and B arc estimated via the minimization of the empirical cross- 
entropy error over the training datasct. 

Wang et al. (2008) proposed a nonparametric form obtained from training a sequence 
of weighted classifiers: 



mini|(l-7r,)^[l-yi/(x,)]++7r,- ^ [1 - y,/(x,)]+| + 



(5) 



for j = l,...,m-|-l such that = vri < ••• < vTm+i = 1- Let f-Kji'^) be the solution 
of (5). The estimated class probability is then 57(x) = ^(tt* -|- tt*) where tt* = minjTTj : 
sign(/7r,.(x)) = -1} and tt* = max{7rj : sign(/7r^.(x)) = 1}. 

Additional contributions are due to Steinwart (2003) and Bartlett and Tewari (2007). 
These authors showed that the class probability can be asymptotically estimated by replac- 
ing the hinge loss with various differentiable losses. 



3. Coherence P\inctions 

In this section we present a smooth and Fisher-consistent majorization loss, which bridges 
the hinge loss and the logit loss. We will see that one limit of this loss is equal to the 
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hinge loss. Thus, it is applicable to the asymptotical estimate of the class probability for 
the conventional SVM as well as the construction of margin-based classifiers, which will be 
presented in Section 3.5 and Section 4. 



3.1 Definition 

Under the 0—1 loss the misclassification costs are specified to be one, but it is natural to 
set the misclassification costs to be a positive constant u> 0. The empirical generalization 
error on the training data is given in this case by 

1 " 
1=1 

where u > is a constant that represents the misclassification cost. In this setting we can 
extend the hinge loss as 

iJ„(y/(x)) = [u-yf{^)]+. (6) 
It is clear that i?„(y/(x)) > ■M-'^[2//(x)<o] • This implies that Hu{yf{yi)) is a majorization of 

^%/(x)<0]- 

We apply the maximum entropy principle to develop a smooth surrogate of the hinge 
loss [«— ^;]+. In particular, noting that [-u— 2;]+ = max{n— 2;, 0}, we maximize w{u—z) with 
respect to w e (0, 1) under the entropy constraint; that is, 

max \F = w{u—z) — p\w log w + {1 — w) log(l — w) \ >, 
we{Q,i) < i 

where —[wlogw + (1— u;) log(l— it;)] is the entropy and p > 0, & Lagrange multiplier, plays 
the role of temperature in thermodynamics. 
The maximum of F is 

Vp,u{z) = p log [1+ exp -—] (7) 

at w = cxp{{u—z)/p)/[l + exp{{u—z)/p)]. Wc refer to functions of this form as coherence 
functions because their properties (detailed in the next subsection) are similar to statistical 
mechanical properties of deterministic annealing (Rose et al., 1990). 
We also consider a scaled variant of Vp^u{z)- 

log[l+ exp(n/p)J '- p ■' 

which has the property that Cp^u{z) = u when z = Q. Recall that w as a misclassification 
cost should be specified as a positive value. However, both Cp,o(^) and Vp^Q{z) are well 
defined mathematically. Since Cpfi{z) = is a trivial case, we always assume that u > for 
Cp,u{z) here and later. In the binary classification problem, z is defined as yf{x). In the 
special case that u = 1, Cp^i(y/(x)) can be regarded as a smooth alternative to the SVM 
hinge loss [1 — y/(x)]_|_. We refer to Cp,„(y/(x)) as C-losses. 

It is worth noting that Vifl{z) is the logistic function and Vpfi{z) has been proposed 
by Zhang and Oles (2001) for binary logistic regression. We keep in mind that u > for 
Vp^u{z) through this paper. 
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3.2 Properties 

It is obvious that Cp^u{z) and Vp^u are infinitely smooth with respect to z. Moreover, the 
first-order and second-order derivatives of Cp^u{z) with respect to z are given as 



p log[l-F exp{u/p)] 1+ exp ^ ' 



' p2 iog[i+ exp(u/;o)] (1+ exp ' 

Since Cp^^{z) > for any z G M, Cp^ui^) as weh as Vp^u{z) are strictly convex in z, for fixed 
p > and « > 0. 

We now investigate relationships among the coherence functions and hinge losses. First, 
we have the following properties. 

Proposition 2 Let Vp^uiz) and Cp^u{z) be defined by (7) and (8). Then, 

(i) «x/[2<o] < [u-z\j^ < Vp^u{z) < plog2+[u-z]+; 

(ii) ^{u-z)<Vp,u{z)-p\og2; 

(iii) limp^o VpA^) = [^-^]+ "^'^ limp^oo Vp,u{^) " plog2 = ^(ti-^;); 

(iv) «x7[^<o] < Cp,u(^;) < Vp^z); 

(v) hmp^o C'p.-ul^^) = [u-z]+ and limp^oo Cp^z) = u, for u > 0. 

As a special case of u = 1, we have Cp^z) > I[z<o]- Moreover, Cp^i{z) approaches 
(1—2;)+ as p 0. Thus, Cpj{z) is a majorization of I[z<o]- 

As we mentioned earlier, Vpfi{z) are used to devise logistic regression models. We can 
see from Proposition 2 that Vp^z) > [—2]+, which implies that a logistic regression model 
is possibly no longer a large-margin classifier. Interestingly, however, we consider a variant 
of Vp^z) as 

Lp,u(.z) = - — log [1+ exp(('u - z)/p)] , p>0, u>0, 

log(l -I- exp('u/p)) ■' 

which always satisfies that Lp^z) > I[z<o\ and Lp^u(O) = 1, for any u > 0. Thus, the 
Lp,u{z) for p > and ii > are majorizations of /[2<o]- In particular, Lp^z) = Cp^u) and 
Li^z) is the logit function. 

In order to explore the relationship of Cp^z) with (n— we now consider some 
properties of Lp^z) when regarding it respectively as a function of p and of u. 

Proposition 3 Assume p > and u>0. Then, 

(i) Lp^u{z) is a deceasing function in p if z < 0, and it is an increasing function in p if 
z > 0; 
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(ii) Lp^u{z) is a deceasing function in u if z < 0, and it is an increasing function in u if 
z > 0. 

Results similar to those in Proposition 3-(i) also apply to Cp^u{z) because of Cp^u{z) = 
uLp^u{z). Then, according to Proposition 2-(v), we have that u = limp_^+oo C'p^u(z) < 
Cp,u{z) < linip^o Cp^u{z) = {u-z)+ if z < and (n-z)+ = linip^o CpA^) ^ C'p,«(^) < 
linip^+oo Cp^u (2) = u if z > 0. It follows from Proposition 3-(ii) that Cp^i{z) = Lp^i(z) < 
Lpfi{z) if 2; < and Cp^i{z) = Lp^i{z) > Lpfi{z) if z > 0. In addition, it is easily seen that 
(1 — z)+ > ((1 — z)+)'^ if z > and (1 — z)+ < ((1 — z)+)^ otherwise. We now obtain the 
following proposition: 

Proposition 4 Assume p > 0. Then, Cp^i{z) < min{Lp^o(^)) [1—^]+; 
z<0, and Cp,i(2;) > max{Lp,o(z), ['i--z]+, ([1-2;]+)^} ifz>0. 

This proposition is depicted in Figure 1. Owing to the relationships of the C-loss Cp^i{yf{x)) 
with the hinge and logit losses, it is potentially useful in devising new large-margin classifiers. 

We now turn to the derivatives of Cp^u{z) and (tx— 2;)+. It is immediately verified that 
— 1 < C'p,^{z) < 0. Moreover, we have 

z > u, 
— ^ z = u, 
—1 z < u. 

Note that (tt— z)'_,_ = — 1 if 2; < n and (w— z)'_,_ = if 2; > u. Furthermore, d{u—z)-i.\z=u = 
[—1,0] where d{u—z)+\z=u denotes the subdifferential of {u—z)+ at z = u. Hence, 

Proposition 5 For a fixed u> 0, we have that limp_!.o Cp^^{z) € d{u—z)+. 

This proposition again establishes a connection of the hinge loss with the limit of 
Cp,u{z) at p = 0. Furthermore, we obtain from Propositions 2 and 5 that d{u—z)^ = 
dlimp^oCp^uiz) 3 limp^odCp^uiz). 

3.3 Consistency in Classification Methods 

We now apply the coherence function to the development of classification methods. Recall 
that Cp^„(0) exists and is negative. Thus, the C-loss C p^u{y f {x)) is Fisher-consistent (or 
classification calibrated) (Bartlett et al., 2006). In particular, we have the following theorem. 

Theorem 6 Assume < 77 < 1 and rj ^ \. Consider the optimization problem 

min i?(/,r/) := Fp,„(/)ry + Fp,„(-/)(1 - r?) 

J EM. 

for fixed p > and u>0. Then, the minimizer is unique and is given by 



(277-1) exp(H) + ./(l-27?)2 exp(f ) + 47?(l-7?) 
MV) = Plog • (9) 



limC;,„(.) = limy;,„(2) = 
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Figure 1: These functions are regarded as a function of z = yf{x). (a) Coherence functions 
Vp^i{z) with p = 0.01, p = 0.1, /J = 1 and p = 2. (b) A variety of majorization 
loss functions, C-loss: Ci^i{z); Logit loss: Li^o{z); Exponential loss: exp(— z/2); 
Hinge loss: [1—2]+; Squared Hinge Loss: ([1— (c) Cp^i{z) (or Lp_i(z)) 
with p = 0.1,p = l,p=10 and p = 100 (see Proposition 3-(i)). (d) Li^ui^) 
with u = 0, u = 0.1, u = 1 and n = 10 (see Proposition 3-(ii)). 



Moreover, we have /=„ > i/ and only if r] > 1/2. Additionally, the inverse function f^ ^(f) 
exists and it is given by 

l + exp(^) 

Vif) ■■= f:\f) = , / -TZ^, for / G M. (10) 

l+exp(-^) + l + exp(i^) 

The minimizer /*(x) of R{f{x)) := E{Vp^uiY fi^))\^ = x) and its inverse fj(x) are 
immediately obtained by replacing / with /(x) in (9) and (10). Since for u > the min- 
imizers of E(Cp^tj(y/(X))|X = x) and K{Vp^u(X f iX))\X — ^) ^.re the same, this theorem 
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shows that Cp{yf{-K),u) is also Fisher-consistent. We see from Theorem 6 that the expUcit 
expressions of /*(x) and its inverse t?(x) exist. In the special case that u = 0, we have 

/*(x) = plog iTi^ and r)(x) = i+cxp(-f(y(.)/p) - Furthermore, when p = 1, as expected, we 
recover logistic regression. In other words, the result is identical with that in Proposition 1 
for logistic regression. 

We further consider properties of f*{i]). In particular, we have the following proposition. 
Proposition 7 Let /*(??) be defined by (9). Then, 

(i) sign(/*(?7)) = sign(?7-l/2). 

(ii) limp_j.o /*(r7) = uxsign(r7 - 1/2). 

(iii) /*(^) = > ^(1^^) with equality if and only ifu = 0. 

Proposition 7-(i) shows that the classification rule with /*(x) is equivalent to the Bayes 
rule. In the special case that u = 1, wc have from Proposition 7-(ii) that limp^o/*(x) = 
sign(?7(x) — 1/2). This implies that the current /*(x) approaches the solution of ]E((1 — 
Yf{X))^\X = x), which corresponds to the conventional SVM method (see Proposition 1). 

We now treat fj{f) as a function of p. The following proposition is easily proven. 

Proposition 8 Let f]{f) be defined by (10). Then, for fixed / G M and u> Q, \\m.p^^ fj{f) = 
^ and 



lira fj{f ) 
(0— >0 



1 iff>u, 

1 ^ff = u, 

2 if - u < f < u, 
I if f = -u, 

iff<-u. 



As we discuss in the previous subsection, Vp^u{z) is obtained when setting w = exp((n— z) /p)/ (1+ 
exp((u— z)/p)) by using the maximum entropy principle. Let z = y/(x). We further write 
w as wi{f) = 1/[1 + exp((/— u)//9)] when y = 1 and as W2{f) = 1/[1 + exp(— (/-|-'u)/p)] 
when y = —1. 

We now explore the relationship of ry(/) with wi{f) and W2{f)- Interestingly, we first 
find that 

"^^^^ ^ Miy+Mf)' 

It is easily proven that wi{f) + W2{f ) > 1 with equality if and only if u = 0. We thus 
have that fj{f) < W2{f), with equality if and only if = 0; that is, the loss becomes logit 
function Vp^Q^z). Note that (/) represents the probability of the event {u + f > 0} 
and fj{f) represents the probability of the event {/ > 0}. Since the event {/ > 0} is a 
subset of the event + / > 0}, we have fj{f) < W2{f)- Furthermore, the statement that 
fi{f) = W2{f) if and only if = is equivalent to {-u + / > 0} = {/ > 0} if and only if 
u = 0. This implies that only the logit loss induces ??(/) = W2{f)- 

As discussed in Section 2.1, 77(x) can be regarded as a reasonable estimate of the 
true class probability r/(x). Recall that AR[r],f) = R{r], f) — R{ri, f^ij^)) and Ai?j = 
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Ex[^R{r]{X), f{X))] such that ARf can be viewed as the expected distance between ^(x) 
and ??(x). 

For an arbitrary fixed / G M, we have 

1 + exp ^ 1 + exp ^ 

AR{ij, /) = R{rj, /) - R{v, fM) = VP^og ^^f^ + (1 - r?)plog — f^. 

1 + exp — ^-^^ 1 + exp ^ " 

The first-order derivative of AR[r], /) with respect to r] is 

dAR{r),f) , l + exp2^ l + exp^^ 



dv "i + exp^i^ " "l + exp^^' 

The Karush-Kuhn-Tucker (KKT) condition for the minimization problem is as follows: 

exp ^i^M?) exp ^^±^ 

— + (1 - ^):[- — l^m = °' 

1+ exp — 1+ exp p 
and the second-order derivative of AR{r], /) with respect to r] is given by 
d^AR{rj,f) ( 1 , 1 



dri" Vi + exp(-^^^Mz)) i + exp(-^^±^; 



)/:(r/) = hi(/*(r?))+ti;2(/*(r?))]/:(r?). 



According to Proposition 7-(iii) and using wi{f:^{r))) + W2{f*{i])) > 1, we have 

d'^ARjTiJ-) ^ p 
drj^ — _ jj^ ' 

with equality if and only if u = 0. This implies ^' > 0- Thus, for a fixed /, AR{r], f) 

is strictly convex in ry. Subsequently, we have that AR^rj, /) > with equality tj = fj, oi 
equivalently, f = f*- 

Using the Taylor expansion of AR{rj, /) at fj := fj{f) = f^^{f), we thus obtain a lower 
bound for AR{r],f); namely, 

A r./ J.^ ^r.,~J.^ dAR(fj,f), ld'^AR(fj,f) , 
ARif], f) = AR{r], f) ^^(^ - ^) + 2 d^^^"^ ~ 

I d'^ARjfjJ) 2^ P I I ~x2 

= 2 dri^ ^"-"^ ^m^)^"-"^ >MV-V), 

where f] G {fjjT]) C [0,1]. In particular, we have that AR{r],0) > 2p{r] — 0.5)^. According 
to Theorem 2.1 and Corollary 3.1 in Zhang (2004), the following theorem is immediately 
established. 

Theorem 9 Let ei = mif(^.^Qj^Kx[AR{r]{X),f{X))], and let /*(x) G such that 
Ex[Riri{X),f4X))] < inf Ex[i?(r/(X),/(X))] +62 
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for €2 > 0. Then for e = ei + €2 



ARf,=Ex[AR{n{X),f,{X))]<e 



and 




3.4 Analysis 

For notational simplicity, we will use Cp{z) for Cp^i{z). Considering /(x) = a + /3'k, we 
define an regularized optimization problem of the form 



Here we assume that the regularization parameter 7 relies on the number n of training data 
points, thus we denote it by 7„. 

Since the optimization problem (11) is convex with respect to a and (3, the solution 
exists and is unique. Moreover, since Cp is infinitely smooth, we can resort to the Newton- 
Raphson method to solve (11). 

Proposition 10 Assume that 7„ in (11) and 7 in (4) are same. Then the minimizer of 
(11) approaches the minimizer of (4) as p — )• 0. 

This proposition is obtained directly from Proposition 5. For a fixed p, wc arc also 
concerned with the universal consistency of the classifier based on (11) with and without 
the offset term a. 

Theorem 11 Let K{-,-) be a universal kernel on XxX. Suppose we are given such a 
positive sequence {7n} that 7„ — >■ 0. // 



then the classifier based on (11) with a = is universally consistent. 
3.5 Class Probability Estimation of SVM Outputs 

As discussed earlier, the limit of the coherence function, V^^i(y/(x)), at p = is just the 
hinge loss. Moreover, Proposition 7 shows that the minimizer of V^,i(/)?7 + ^,i(— /)(1— ^) 
approaches that of H{f)r] + H{— f ){l—rj) as p — t- 0. Thus, Theorem 6 provides us with an 
approach to the estimation of the class probability for the conventional SVM. 




(11) 



'^Tn/logn 00, 



then the classifier based on (11) is strongly universally consistent. If 
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In particular, let /(x) be the solution of the optimization problem (4) for the conven- 
tional SVM. In terms of Theorem 6, we suggest that the estimated class probability ^(x) is 
defined as 

1 + exp(^^^-;^ — ) 

^(x) = . " . . (12) 

1 + cxp( ^) + 1 + exp(^^-!--^ — ) 

Proposition 7 would seem to motivate setting p to a very small value in (12). However, as 
shown in Proposition 8, the probabilistic outputs degenerate to 0, 1/3, 1/2, 2/3 and 1 in 
this case. Additionally, the classification function /(x) = a + Y^^^i /3jif (x,Xj) is obtained 
via fitting a conventional SVM model on the training data. Thus, rather than attempting to 
specify a fixed value of p via a theoretical argument, we instead view it as a hyperparameter 
to be fit empirically. 

In particular, we fit p by minimizing the generalized KuUback-Leibler divergence (or 
cross-entropy error) between fj{X) and r/(X), which is given by 

GKL(r?, rj) = Ex \v{X) log ^ + (l-r?(X)) log ^""^^^^ 



fjiX) ' " "l-r}(X) 

Alternatively, we formulate the optimization problem for obtaining p as 

1 '^^ 1 1 
min EKL(^) := _ - ^ |-(y, + l)logr7(xi) + ^(1 - yi)log(l - (13) 

The problem can be solved by the Newton method. In summary, one first obtains /(x) = 

OL-\-^2^=\ Pi^i^j^i) via the conventional SVM model, and estiamtes p via the optimization 
problem in (13) based on the training data; one then uses the formula in (12) to estimate 
the class probabilities for the training samples as well as the test samples. 



4. C-Learning 

Focusing on the relationships of the C-loss Cp{yf{x)) (i.e., Cp,i(y/(x))) with the hinge and 
logit losses, we illustrate its application in the construction of large-margin classifiers. Since 
Cp{yf{x)) is smooth, it does not tend to yield a sparse classifier. However, we can employ 
a sparsification penalty J(/i) to arrive at sparseness. We use the elastic-net penalty of Zou 
and Hastie (2005) for the experiments in this section. Additionally, we study two forms of 
/(x): kernel expansion and feature expansion. Built on these two expansions, sparseness 
can subserve the selection of support vectors and the selection of features, respectively. The 
resulting classifiers are called C -learning. 



4.1 The Kernel Expansion 

In the kernel expansion approach, given a reproducing kernel K{-,-) : A' x ^ R, we define 
the kernel expansion as /(x) = a + Y17=i Pi^i'^i^'^) solve the following optimization 
problem: 

min - J2 CpiVifi^i)) +^{{l-u)\l3'-KI3 + a;||/3||i) , (14) 
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where K = [ii'(xj,Xj)] is the nxn kernel matrix. 

It is worth pointing out that the current penalty is slightly different from the conven- 
tional elastic-net penalty, which is (1— a;)|/3'/3 -H w||/3||i. In fact, the optimization prob- 
lem (14) can be viewed equivalently as the optimization problem 

1 " 

min - ^ Cp(y./(x,)) + |/3'K/3 (15) 



i=l 



under the £i penalty Thus, the method derived from (14) enjoys the generalization 

ability of the conventional kernel supervised learning method derived from (15) but also the 
sparsity of the ii penalty. 

Recently, Friedman et al. (2010) devised a pathwise coordinate descent algorithm for 
regularized logistic regression problems in which the elastic-net penalty is used. In order 
to solve the optimization problem in (14), we employ this pathwise coordinate descent 
algorithm. 

Let the current estimates of a and /3 be a and /3. We first form a quadratic approximation 
to Ti EILi CpiVifi^i)), which is 

1 " 

Qia, P) = ^Y1 - '?(^*)) (« + ^if^ + Const, (16) 

i=l 

where 

Zi = a + k'i^ + 



yi(l-g(xj))' 
exp[(l-j/i(a + k^y3)//9] 



l + exp[(l-y,(a + k^^))/p]' 
kj = (iC(xi,Xj),...,ir(x„,Xj))'. 

We then employ coordinate descent to solve the weighted least-squares problem as fol- 
lows: 

min Gia,(3):=Q{a,/3) + j({l-u)l^'K^ + u;\\/3\u). (17) 

Assume that we have estimated /3 for /3 using G{a, 13). We now set ^^fe^ = to find the 
new estimate of a: 

- Er=ig(x.)(i-g(x.))(z.-k^^) 

Er=i9(x.)(i-9(x.)) ■ ^ ' 

On the other hand, assume that we have estimated a for a and (3i for /?; (/ = 1, . . . , n, / 7^ 
j). We now optimize Pj. In particular, we only consider the gradient at Pj / 0. If /3j > 0, 
we have 

= ^,J2 KijQ{y^){l - g(x,))(a+k;^-zO + ^{l-u;){KjjPj+kj$) + 70; 
upj np 
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Algorithm 1 The coordinate descent algorithm for binary C-learning 



Input: r = {x,,i/,}f^i, 7, u, e^, Cj, p; 
Initialize: a = ckq, /3 = /3o 
repeat 

Calculate G{a,^) using (17); 

a* d; 

repeat 

a a; 

Calculate d using (18); 
for j = 1 to n do 

Calculate using (19); 
end for 
until \\a — a\\ + ||/3 — 0\\ < 
until lid - a*|| + ||/9 - (3*\\ < 
Output: d, y9, and /(x) = d + -fC(xi, x)/3i. 



and, hence, 

' TfcEr=i ^§9(^0(1 -9(xi)) + 7(1 -a;)i^,/ ^ ^ 

where t = Er=i i^u9(xi)(l - g(x^)) (^i - d - k^;3) , ^ = (^i, . . . , 0, . . . , ^„)', 
= K(xi,Xj), and S{iJ.,iy) is the soft-thresholding operator: 

^(/x, z/) = sign(/x)(|/x| - z/)+ 

' fjL — V if /X > and /x < li^l 

= < ^ + V if;[/<0 and /Lt < li^l 
^0 il n>\v\. 

Algorithm 1 summarizes the coordinate descent algorithm for binary C-learning. 
4.2 The Linear Feature Expansion 

In the linear feature expansion approach, we let /(x) = a + x'b, and pose the following 
optimization problem: 



1 

min - Cp{yif{^^)) + 7^(b), (20) 

where for cj G [0, 1] 

^(b) = (1 - a;)^||b||i +a;||b||i = ^ [^(1 - uj)b] + u;\bj\ . 

i=i 

The elastic-net penalty maintains the sparsity of the ii penalty, but the number of variables 
to be selected is no longer bounded by n. Moreover, this penalty tends to generate similar 
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coefficients for highly-correlated variables. We also use a coordinate descent algorithm 
to solve the optimization problem (20). The algorithm is similar to that for the kernel 
expansion and the details are omitted here. 

5. Experimental Results 

In Section 5.1 we report the results of experimental evaluations of our method for class 
probability estimation of the conventional SVM given in Section 3.5. In Section 5.2 we 
present results for the C-learning method given in Section 4. 

5.1 Simulation for Class Probability Estimation of SVM Outputs 

We validate our estimation method for the class probability of SVM outputs ("Ours for 
SVM"), comparing it with several alternatives: Piatt's method (Piatt, 1999), Sollich's 
method (Sollich, 2002), and the method of Wang et al. (2008) (WSL's). Since penalized (or 
regularized) logistic regression (PLR) and C-learning can directly calculate class probability, 
we also implement them. Especially, the class probability of C-learning outputs is based on 
(10) where we set /> = 1 and u = 1 since C-learning itself employs the same setting. 

We conducted our analysis over two simulation datasets which were used by Wang et al. 
(2008). The first simulation dataset, {{xii,Xi2;yi)}j^i , was generated as follows. The 
{{xii,Xi2)}l^i were uniformly sampled from a unit disk {(xi,X2) : -|- a;| < 1}. Next, we 
set j/j = 1 if Xii > and yi = —1 otherwise, i = 1, . . . , 1000. Finally, we randomly chose 20% 
of the samples and flipped their labels. Thus, the true class probability ri{Yi = l\xii,Xi2) 
was either 0.8 or 0.2. 

The second dataset, {{xa,Xi2;yi)}j^i , was generated as follows. First, we randomly 
assigned 1 or —1 to for ? = 1, ... , 1000 with equal probability. Next, we generated Xn from 
the uniform distribution over [0, 27r], and set Xi2 = yi{sm{xii) + Si) where ej ~ Ar(ej|l, 0.01). 
For the data, the true class probability of y = 1 was given by 

. , , iV(x2|sin(xi)+l,0.01) 

r){ \xi,X2) jv(x2| sin(xi)+l, 0.01) +Ar(x2| -sin(a;i)-l, 0.01)' 

The simulation followed the same setting as that in Wang et al. (2008). That is, we ran- 
domly selected 100 samples for training and the remaining 900 samples for test. We did 100 
replications for each dataset. The values of generalized Kullback-Leibler loss (GKL) and 
classification error rate (CER) on the test sets were averaged over these 100 simulation repli- 
cations. Additionally, we employed a Gaussian RBF kernel iir(xj,Xj) = exp(—||xj—Xj p/cr^) 
where the parameter a was set as the median distance between the positive and negative 
classes. We reported GKL and CER as well as the corresponding standard deviations in 
Tables 2 and 3 in which the results with the PLR method, the tuned Piatt method and the 
WSL method are directly cited from Wang et al. (2008). 

Note that the results with PLR were averaged only over 66 nondegenerate replica- 
tions (Wang et al., 2008). Based on GKL and CER, the performance of C-learning is the 
best in these two simulations. With regard to GKL, our method for SVM outperforms the 
original and tuned versions of Piatt's method as well as the method of Wang et al. (2008). 
Since our estimation method is based on the ?}(x) in (12), the CER with this class prob- 
ability )7(x) is identical to that with the conventional SVM. This also applies to Sollich's 
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method, thus we did not include the CER with this method. However, Table 3 shows that 
this does not necessarily hold for Piatt's method for SVM probability outputs. In other 
words, rj{-K) > 1/2 is not equivalent to /(x) > for Piatt's method. In fact, Piatt (1999) 
used this sigmoid-like function to improve the classification accuracy of the conventional 
SVM. As for the method of Wang et al. (2008) which is built on a sequence of weighted 
classifiers, the CERs of the method should be different from those of the original SVM. 
With regard to CER, the performance of PLR is the worst in most cases. 

In addition. Figure 2 plots the estimated values of parameter p with respect to the 100 
simulation replications in our method for class probability estimation of the original SVM. 
For simulation 1, the estimated values of p range from 0.3402 to 0.8773, while they range 
from 0.1077 to 1.3166 for simulation 2. 



Table 2: Values of GKL over the two simulation test sets (standard deviations are shown 
in parentheses). 





PLR 


Piatt's 


Tuned Piatt 


WSL's 


Sollich's 


Ours for SVM 


C-learning 


Data 1 


0.579 
(±0.0021) 


0.582 
(±0.0035) 


0.569 
(±0.0015) 


0.566 
(±0.0014) 


0.566 
(±0.0021) 


0.558 
(±0.0015) 


0.549 
(±0.0016) 


Data 2 


0.138 
(±0.0024) 


0.163 
(±0.0018) 


0.153 
(±0.0013) 


0.153 
(±0.0010) 


0.155 
(±0.0017) 


0.142 
(±0.0016) 


0.134 
(±0.0014) 



Table 3: Values of CER over the two simulation test sets (standard deviations are shown 
in parentheses). 





PLR 


Plait 's 


WSL's 


Ours for SVM 


C-loMrniug 


Data 1 


0.258 


0.234 


0.217 


0.219 


0.214 


(±0.0053) 


(±0.0026) 


(±0.0021) 


(±0.0021) 


(±0.0015) 


Data 2 


0.075 


0.077 


0.069 


0.065 


0.061 


(±0.0018) 


(±0.0024) 


(±0.0014) 


(±0.0015) 


(±0.0019) 



5.2 The Performance Analysis of C-Learning 

To evaluate the performance of our C-learning method, we further conducted empirical 
studies on several benchmark datasets and compared C-learning with two closely related 
classification methods: the hybrid huberized SVM (HHSVM) of Wang et al. (2007) and 
the regularized logistic regression model (RLRM) of Friedman et al. (2010), both with 
the elastic-net penalty. All the three classification methods were implemented in both the 
feature and kernel expansion settings. 
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20 40 60 80 100 20 40 60 80 100 



Index of siimilatioii replications Index of simulation replications 

(a) Simulation 1 (b) Simulation 2 

Figure 2: The learned values of parameter p vs. simulation replications in our estimation 
method for the class probability of SVM outputs. 



In the experiments we used 11 binary classification datasets. Table 4 gives a summary 
of these benchmark datasets. The seven binary datasets of digits were obtained from the 
publicly available USPS dataset of handwritten digits as follows. The first six datasets were 
generated from the digit pairs {(1, 7), (2, 3), (2, 7), (3, 8), (4, 7), (6,9)}, and 200 digits were 
chosen within each class of each dataset is 200. The USPS (odd vs. even) dataset consisted 
of the first 80 images per digit in the USPS training set. 

The two binary artificial datasets of "g241c" and "g241d" were generated via the setup 
presented by Chapelle et al. (2006). Each class of these two datasets consisted of 750 
samples. 

The two binary gene datasets of "colon" and "leukemia" were also used in our experi- 
ments. The "colon" dataset, consisting of 40 colon tumor samples and 22 normal colon tissue 
samples with 2,000 dimensions, was obtained by employing an Affymetrix oligonucleotide 
array to analyze more than 6,500 human genes expressed in sequence tags (Alon et al., 
1999). The "leukemia" dataset is of the same type as the "colon" cancer dataset (Golub 
et al., 1999), and it was obtained with respect to two variants of leukemia, i.e., acute myeloid 
leukemia (AML) and acute lymphoblastic leukemia (ALL). It initially contained expression 
levels of 7129 genes taken over 72 samples (AML, 25 samples, or ALL, 47 samples), and 
then it was pre-feature selected, leading to a feature space with 3571 dimensions. 

In our experiments, each dataset was randomly partitioned into two disjoint subsets 
as the training and test, with the percentage of the training data samples also given in 
Table 4. Twenty random partitions were chosen for each dataset, and the average and 
standard deviation of their classification error rates over the test data were reported. 

Although we can seek an optimum p using computationally intensive methods such as 
cross-validation, the experiments showed that when p takes a value in [0.1,2], our method 
is always able to obtain promising performance. Here our reported results are based on the 
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Table 4: Summary of the benchmark datasets: m — the number of classes; d — the dimension 
of the input vector; k — the size of the dataset; n — the number of the training data. 



Dataset 


m 


d 


k 


n/k 


USPS (1 vs. 


7) 


2 


256 


400 


3% 


USPS (2 vs. 


3) 


2 


256 


400 


3% 


USPS (2 vs. 


7) 


2 


256 


400 


3% 


USPS (3 vs. 


8) 


2 


256 


400 


3% 


USPS (4 vs. 


7) 


2 


256 


400 


3% 


USPS (6 vs. 


9) 


2 


256 


400 


3% 


USPS (Odd 


vs. Even) 


2 


256 


800 


3% 


g241c 




2 


241 


1500 


10% 


g241d 




2 


241 


1500 


10% 


colon 




2 


2000 


62 


25.8% 


leukemia 




2 


3571 


72 


27.8% 



Table 5: Classification error rates (%) and standard deviations on the 11 datasets for the 
feature expansion setting. 



Dataset 


HHSVM 


RLRM 


C-leaxning 


(1 vs. 7) 


2.29±1.17 


2.06±1.21 


1.60±0.93 


(2 vs. 3) 


8.13±2.02 


8.29±2.76 


8.32±2.73 


(2 vs. 7) 


5.82±2.59 


6.04±2.60 


5.64±2.44 


(3 vs. 8) 


12.46±2.90 


10.77±2.72 


11.74±2.83 


(4 vs. 7) 


7.35±2.89 


6.91±2.72 


6.68±3.53 


(6 vs. 9) 


2.32±1.65 


2.15±1.43 


2.09±1.41 


(Odd vs. Even) 


20.94±2.02 


19.83±2.82 


19.74±2.81 


g241c 


22.30±1.30 


21.38±1.12 


21.34±1.11 


g241d 


24.32±1.53 


23.81±1.65 


23.85±1.69 


colon 


14.57±1.86 


14.47±2.02 


12.34±1.48 


leukemia 


4.06ib2.31 


4.43ibl.65 


3.21ibl.08 



setting of /9 = 1, due to the relationship of the C-loss C{z) with the hinge loss (1 — z)+ and 
the logit loss log(l + cxp(— z)) (see our analysis in Section 3 and Figure 1). 

As for the parameters 7 and w, they were selected by cross-validation for all the classifi- 
cation methods. In the kernel expansion, the RBF kernel iir(xj,Xj) = exp(— ||xi — Xj|p/(T^) 
was employed, and a was set to the mean Euclidean distance among the input samples. For 
C-learning, the other parameters were set as follows: = = 10~^. 

Tables 5 and 6 show the test results corresponding to the linear feature expansion 
and RBF kernel expansion, respectively. From the tables, we can see that for the overall 
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Table 6: Classification error rates (%) and standard deviations on the 11 datasets for the 
RBF kernel setting. 



Dataspt 


HHSVM 


RLRM 


C-lGcirning 


(1 VS. 7) 


1.73±1.64 


1.39±0.64 


1.37±0.65 


(2 vs. 3) 


8.55±3.36 


8.45±3.38 


8.00±3.32 


(2 vs. 7) 


5.09±2.10 


4.02±1.81 


3.90±1.79 


(3 vs. 8) 


12.09±3.78 


10.58±3.50 


10.36±3.52 


(4 vs. 7) 


6.74±3.39 


6.92±3.37 


6.55±3.28 


(6 vs. 9) 


2.12±0.91 


1.74±1.04 


1.65±0.99 


(Odd vs. Even) 


28.38±10.51 


26.92±6.52 


26.29±6.45 


g241c 


21.38±1.45 


21.55±1.42 


21.62±1.35 


g241d 


25.89±2.15 


22.34±1.27 


20.37±1.20 


colon 


14.26±2.66 


14.79±2.80 


13.94±2.44 


leukemia 


2.77±0.97 


2.74±0.96 


2.55±0.92 



performance of C-learning is slightly better than the two competing methods in the feature 

and kernel settings generally. 

Figure 3 reveals that the values of the objective functions for the linear feature and 
RBF kernel versions in the outer and inner iterations tend to be significantly reduced as 
the number of iterations in the coordinate descent procedure increases. Although we report 

only the change of the values of the objective function for the dataset USPS (1 vs. 7) similar 
results were found on all other datasets. This shows that the coordinate descent algorithm 
is very efficient. 

We also conducted a systematic study of sparseness from the elastic-net penalty. Indeed, 
the elastic-net penalty does give rise to sparse solutions for our C-learning methods. More- 
over, wc found that similar to other methods the sparseness of the solution is dependent on 
the parameters 7 and u that were set to different values for different datasets using cross 
validation. 

6. Conclusions 

In this paper we have studied a family of coherence functions and considered the relation- 
ship between coherence functions and hinge functions. In particular, we have established 
some important properties of these functions, which lead us to a novel approach for class 
probability estimation in the conventional SVM. Moreover, we have proposed large-margin 
classification methods using the C-loss function and the clastic-net penalty, and developed 
pathwise coordinate descent algorithms for parameter estimation. We have theoretically 
established the Fisher-consistency of our classification methods and empirically tested the 
classification performance on several benchmark datasets. Our approach establishes an in- 
teresting link between SVMs and logistic regression models due to the relationship of the 
C-loss with the hinge and logit losses. 
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1 2 3 4 5 6 5 10 15 20 25 30 

Number of Iteration Number of Iteration 



(a) (b) 

Figure 3: Change of the value of the objective function for the C-learning as the number of 
iterations in the coordinate descent procedure increases in the linear feature and 
RBF kernel cases on the dataset USPS (1 vs. 7): (a) the values of the objective 
function (14) in the outer iteration; (b) the objective function values G(a, j3) for 
the fature and RBF kernel cases in the inner iteration. 



Appendix A. The Proof of Proposition 2 

First, we have 



2expl^ 



p\og2+[u-z]+-Vp,u{z) = p\og— ^ >0. 

1+ exp 



Second, note that 



p\og2 + ^—-Vp^u{z)=p\og———^ 
^ ii-exp— ^ 

u—z 



exp ■ 

<plog— ^ <0, 

1+exp 

where we use the fact that exp(-) is convex. 

Third, it immediately follows from Proposition (i) that limp^o yp,u{z) = \u — z\j^. More- 
over, it is easily obtained that 

log llf!^ _ _ log 1+ exp 



lim Vpu{z) — /olog2 = hm ^ = lim 

p— >oo p-¥oo ± a— >-0 a 

P 

^[u-z]exp[a{u-z)] 1 
= lim — 7 ^ = -(u—z). 

a^O 1+ exp a{u~z) 2 

2 
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Since log(l + a) > log(a) for a > 0, we have 



u 



- — r- — log [1+ exp ] < — log [1+ exp ] = plog [l+ exp ] 

log[l+exp(u//9)J '- p ■' u/p '- p ^ '- p ■' 



We now consider that 



log 



1+ exp 



lim Cp^u{z) = u lim 
p-»-oo p->-oo log[l+ exp(u/p)J 



u. 



Finally, since 



a-)-oo au a-^oo 1+ exp('UQ!) 



we obtain \\m.p^Q Cp^u{z) = [u — z\j^. 

Appendix B. The Proof of Proposition 3 

Before we prove Proposition 3, we establish the following lemma. 

Lemma 12 Assume that x > 0, then fi{x) = ^ iog(i+x) "'^^ /2(a;) = if^ \og(i+x) "■'^^ 
increasing and deceasing, respectively. 

Proof The first derivatives of fi{x) and f2{x) are 
1 



m = 



(l + x)2log2(l + a;) L 

-— — i^-— -[log(l + x) - x] < 
(1 + xj^log (1 + x) 



log X log(l + x) + log(l + x) + X log(l + x) — X log X 



This implies that f2{x) is deceasing. If logx > 0, we have xlog(l + x) — xlogx > 0. Oth- 
erwise, if logx < 0, wc have Iogx[log(l + x) — x\ > 0. This implies that /((x) > is always 
satisfied. Thus, fi{x) is increasing. ■ 



Let a = 1/ p and use hi(oi) for Lp^u{z) to view it as a function of a. We now compute 
the derivative of hiia) w.r.t. a: 



, log[l + exp(a(^-2:))] 

' log[l+exp(tfa)] 
exp(a(u— ^;)) 



u—z 



exp(au) 



u 



1+ exp(a(n— z)) log[l+ exp(Q;(n— z))] 1+ exp(Q;'u) log[l+ exp(an)] 
log[l + exp(Q!(u— z))] 

a log[l+ exp(ua)] 

■ exp(a(ti— z)) log exp(a(tt— z)) exp(au) logexp(au) 

- 1+ exp(Q;(it— z)) log[l+ exp(a(?x— z))] 1+ exp(att) log[H- exp(ai()] 
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When z < 0, we have exp[a(u — z)) > cxp(au). It then fohows from Lemma 12 that 
hi{a) > 0. When z > 0, we have h'i{a) < due to exp(a('u — z)) < exp(au). The proof of 
(i) is completed. 

To prove part (ii), we regard Lp^u{z) as a function of u and denote it with h2{u). The 
first derivative h2{u) is given by 



^ log[l+CXp(ltQ!)] 

exp(Q!(n— z)) 



1 



cxp (an) 



1 



-l+exp(a(u— z)) log[l+ cxp(a(n— z))] l+exp(aii) log[l+ exp(aii)] 
Using Lemma 12, we immediately obtain part (ii). 



Appendix C. The Proof of Theorem 6 

We write the objective function as 

L{f) = Vp,u{f)v + Vp,u{-f){l-r,) 

u — f u~\~ f 
= plog Tl+exp Iry + plog Tl+exp 1 (l-rj). 

'- p '- p 

The first-order and second-order derivatives of L w.r.t. / are given by 



dL 



exp 



n-f 



u+f 



-V 



1+exp^^ 



exp „ 



(PL r] exp 



+ 



1 — ?7 exp 



dp /9 1+exp^ 1+exp^ P 1+exp ^ 1+exp ^ ' 
Since ^ > 0, the minimum of L is unique. Moreover, letting ^ = yields (9). 



Appendix D. The Proof of Proposition 7 

First, if 77 > 1/2, we have 47/(1 — 77) > 4(1 — 77)^ and (277 — l)exp(u/p) > 0. This implies 
/* > 0. When rj < 1/2, we have (277 — 1) exp{u/p) > 0. In this case, since 

(1-277)^ ex.p{2u/p) + 477(1 -rj) < il-2rjf exp(2n/p) + 4(l-r/)^ -|- 4(l-77)(l-277) ex.p{u/p), 

we obtain /* < 0. 

Second, letting a = 1/p, we express /* as 



1 (27/-1) exp(ua) -I- x/(l-2r/)2 exp(2uc>;) + 4r/(l-77) 



1. + + 

= - log 



4q(l->?) 
(l-2?7)^ exp(2na) 



a ^ 2(1 - 77)exp(-na)/|277-l 

(277-1 ' 



a 



log 



\2n-l\ 



+ W1 + 



4?7(1— 7/) 



(l-2r/)2 exp(2uQ;) 



a 



log 



2(1-77) 



L \2r)-l\ 



+ u. 
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Thus, if 77 > 1/2, it is clear that hiria-^oo /* = u. In the case that rj < 1/2, we have 

1 1 4r/(l— r/) 

hm f^, = u — u lim , — , — — ex.p(—2ua) 

a->oo a-^oc _-, , , 4q(l-T;) L 4t;(l-??) (1 — ir/j^ 

^ V (1-2??)^ cxp{2«Q) V (l-2?7)^ exp(2ua) 

477(1—77)11 exp(— 2tia) 

= ii , " , hm 



(1-277)2 a->oo , /-I I 4??(1-T?) 

+ + (i_2^)^ exp(2«a) 

= n-27. hm 



a— >oo 

= —u. 



(1-27/)^ exp(2'ua) 



Here wc use I'Hopital's rule in calculating limits. 

Third, let a = exp{u/p). It is then immediately calculated that 



^ , (l-2^)(l-a2) 



(27/-l)a + ^(l-2?7)2a2 + 477(1-77) 1 - ?? 



Consider that 



^ 2(l-2r?)(l-a2) 



^ _ i/(l-2»7)2a2+4^(i_^) 1 

~ (2r?-l)a + ^(l-27/)2a2 + 4r/(l-r/) r/ 

2t;+(l-2t;)a^ 



a 



^/(1-2»7)2q2+4^(1„^) 



T]{2T]-l)a + 7/a/(1-277)2ci;2 +4?7(1-7/) 
It suffices for 7^(77) > ^^p^ to show A > 0. Note that 

(277 + (1 - 27?)a2)2 ^2 ^ 4r;2(l-«2) ^ ^ 



(l-277)2a2 + 47/(1-77) (l-277)2a2 + 477(1-77) 

due to Q > 1, with equality when and only when a = 1 or, equivalently, u = 0. Accordingly, 

1 2r)+(l-2r))a2 

we have a . \ „ > 0. 

Appendix E. The Proof of Theorem 11 

In order to prove the theorem, we define 



6^ := sup{t : -ft^ < 2Vp{0)} = ^/2/^ 

for 7 > and let Vp'^^ (yf) be the coherence function Vp{yf) restricted to x [— (J-y/cmax, ^'yk^as 
where fcmax = maXx^Af -f^(x, x). For the Gaussian RBF kernel, we have /Cmax = 1- 
It is clear that 

||fW|U := sup{yW(y/),(y,/) G:yx[-,5^W,^7^max]} =plog (l + exp "+^'""^^ ) . 
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Considering that 



|T/{7)|| eXD^^ 



lim — ; = lim , 

^-^0 fc^ax 1 + exp ^ 



we have hm-y_j.o ||Vp'''^||oo/\/l77 = V^^max- Hence, we have ||V^^'''^||tx) ~ \/l/7- 
On the other hand, since 

Vi^\yf) = - ^^^^^(/ - h), 

where /2 G C [-5-y/cmax, (^7A:max], we have 

1T/(7)| i \Vl%f)-Vl%h) ^vff.FAi. A i. W^f" 
|l := sup < , yeyjjie [-S^kr^a^, S^kma^], f ^ fl 



sup I 

exp 



Of 

W+femax\/^/7 



n 



1 + exp ^^^^ 

In this case, we have lirn^^o \ Vp"'^\i = 1, which imphes that ^ 1. 

We now immediately conclude Theorem 11 from Corollary 3.19 and Theorem 3.20 of 
Steinwart (2005). 
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